Recently there have been nationwide protests, triggered by events in Ferguson, MO, that have challenged common police practices. The police argue that individuals and neighborhoods are targeted because certain “types” of people are more likely to engage in criminal activity. This type of profiling and the tactics used by police have come under intense public scrutiny. Sophisticated statistical analyses (Gelman et al. 2007) report that in New York City racial minorities are stopped disproportionately by police (even after accounting for local variations in crime rates and prevalence of different groups to be caught for committing crime). The central question, in statistical debates is, “are the police acting based upon data or racial/cultural bias”?
Many cities have started to publish crime statistics. These allow citizens to see limited information about each crime recorded by the city’s police department, typically the location, date, and type of crime. The public discourse around police use of profiling (and physically aggressive tactics) often lacks nuance. On the one hand it seems reasonable to expect police to use data to inform their strategies and tactics, on the other hand its seems entirely unjust to target individuals and places (with force that may be excessive) simply because of their racial and/or econmoic characteristics.
In this lab you will tackle the following questions:
In this lab you will be using crime data, provided by the city of Chicago and Census Tract Level data from the American Community Survey. Download data here. A few words of caution, multicollinearity may be a major and unavoidable problem in your analysis. This makes it difficult to directly compare model coefficients. A conservative approach to multicollinearity is to structure models in such a way that you never have to directly interpret coefficients. For example, only talking about the model in the aggregate (“the \(R^2\) for model 2 is better than the one for model 1”). However, this is unsatisfying and difficult to communicate to policy makers. An alternative approach, that is slightly less conservative, is to control multicollinearity as much as possible and discuss the significance and sign of individual coefficients (but not their magnitudes). How you present your models is a matter of personal judgement.
The data we will be working with is inherently spatial, the tract-level data from both the city and the Census describes places. Thus maps are a convenient way to visualize both the inputs and outputs to our analysis. Visualizing spatial data is fairly easy in R, however making a proper map (with marginalia) is really time consuming. We will be “mapping” things in R, but this term is used quite loosely; the basic maps made in this lab would make most properly trained cartographers squirm and avert their eyes!
We have to load a few libraries for handling and visualizing spatial data.
library(maptools) # Guess what this packahe is for? for creating maps:)
# Load a map and data for Chicago
# These are stored in the "shapefile" format
chicago = readShapePoly("Data/ChicagoCrime.shp")
The chicago object you just created is a SpatialDataFrame, this is a type of R object that contains spatial data (points, lines, or polygons) and the attributes of each element in the data. Working with spatial data frames is slightly different from working with regular data.frames, spatial data.frames have a lot of different pieces. You can see these pieces by typing head(chicago) but this gives reams of hard to digest output. SpatialDataFrames are organized using a series of ‘slots’ to hold data, geometry, projection, and other information necessary for drawing maps. You can see the names of these slots by typing slotNames(chicago).
SpatialDataFrames have special behavior, if you use the plot() function they draw a map. This map only shows the outlines of the polygons (or points/lines) that contain data. Visualizing data on the map requires a few more steps (see below).
plot(chicago)
If you want to see only the data you have to access the ‘data’ slot. This is done using the @ symbol and typing the name of the slot you want to access. For example, chicago@data provides access to the data.frame describing all census tracts in Chicago. You can work with chicago@data just like you would work with a regular data.frame (Note: you might want to create a separate chdata data.frame object by doing chdata = chicago@data later).
One of the things you’ll immediately notice about the Chicago data is that the column names are all truncated to 10 characters. This makes it very hard to understand what each variable stores. This is a legacy of the shapefile format, which (stupidly) is incapable of storing column names of more than 10 characters. We’ll rename the columns so it’s easier to tell what we are working with. You can print the list of variable names for any data frame by typing names(df_name).
# Read in a readable list of variable names (this was created by hand).
# Note the use of ":" as a column delimiter...
# If names are too verbose for your liking edit them in the text file prior to importing
var_names = read.table("Data/Variable Descriptions.txt", sep=":", strip.white=TRUE)
# Attach the modified names to the chicago data
names(chicago@data) = as.character(var_names[, 2])
Drawing a map in R is way more complicated than it needs to be. This lab will present a simple method for making thematic maps, this method, because it is simple, doesn’t allow nuanced control of visual variables. The most robust methods for creating maps involve the use of the package lattice and the function spplot() or the use of the ggplot library and the fortify() function (We’ll learn more about ggplot in future labs).
(Note: You may encounter some trouble using the fortify() function. If running fortify() returns a permission error message, you’ll need to install and load an additional package, rgeos, to get it working properly. Ask your TA if you need assistance with this.)
The process for making simple maps in R is as follows:
income could be divided into categories for 0-$10,000; $10,000-$20,000; $20,000-$30,000 and so on. To do this use function cut or classIntervals, in this lab we will use classIntervals.RColorBrewer library (which is based on the famouse ColorBrewer website).library(classInt) # Library for creating clases
# Create Categories based on quantiles
cats7 = classIntervals(chicago$Number_of_violent_crimes, n=7, style="quantile")
Now we have to choose a color to use to describe the categories.
library(RColorBrewer)
# Display all palettes
# display.brewer.all()
# Lets use this one
# display.brewer.pal(7, "YlGnBu")
# Save palette as an object
pal7 = brewer.pal(7, "YlGnBu")
The last step is to attach the palette to the categories and draw the map.
seven_cols = findColours(cats7, pal7)
# Draw map using specificed data and colors
# lty=0 (line type) turns off tract borders
plot(chicago, col=seven_cols, lty=0)
# normalizing the income by deviding to Area of each Tract
cats7_perarea = classIntervals(chicago$Number_of_violent_crimes / chicago$Area_of_Tract_ , n=7, style="quantile")
seven_cols = findColours(cats7_perarea, pal7)
plot(chicago, col=seven_cols, lty=0)
# normalizing the income by deviding to the population of each tract ((chicago$Population_Density * chicago$Area_of_Tract_))
cats7_perarea = classIntervals(chicago$Number_of_violent_crimes / (chicago$Population_Density * chicago$Area_of_Tract_), n=7, style="quantile")
seven_cols = findColours(cats7_perarea, pal7)
plot(chicago, col=seven_cols, lty=0)
"
Desity of crime is a more reliable representation of crime in the map because
a tract with high/low number of crimes cannot be interpreted as a hot/cold spot crime in the map or vice versa. The crime number should be devided by area (or more appropriately by population) to be better representative of crime pattern in the map.
"Number_of_violent_crimes. What do you notice? Are there any outliers?
boxplot(chicago@data$Number_of_violent_crimes)
"Based on the box plot of Number_of_violent_crimes, it seems that there are some
outliers in the data"
## [1] "Based on the box plot of Number_of_violent_crimes, it seems that there are some \noutliers in the data"
"These outliers can have several reasons:
- The population of a those tracts are much higher than the other tracts and so it make sense to observe more crime
- Maybe there is a racial/cultural bias in the neigborhood for which the crime is reported
- Maybe that is not an actual outlier and the crime rate is really high in a neighborhood
Note: I think outlier is not a ggod term here because the probably the extreme values are not wrong data, and they just should be normalized to be interpreted"
adj_violent_crimes that ‘adjusts’ for potential outliers.chicago$adj_violent_crimes = sqrt(chicago$Number_of_violent_crimes)
chicago$adj_violent_crimes. (It’s fine to keep the square root transform, but justify the choice if you do).par(mfrow=c(3,2))
# SQRT
hist(sqrt(chicago$Number_of_violent_crimes), col="gray")
boxplot(sqrt(chicago$Number_of_violent_crimes),
col="gray",
ylab = "sqrt(Number of violent crimes)",
main = "Box plot of sqrt(chicago$Number_of_violent_crimes)")
# LOG
hist(log(chicago$Number_of_violent_crimes), col="gray")
boxplot(log(chicago$Number_of_violent_crimes),
col="gray",
ylab = "log(Number of violent crimes)",
main = "Box plot of log(chicago$Number_of_violent_crimes)")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 1 is not drawn
# EXP
hist(exp(chicago$Number_of_violent_crimes), col="gray")
boxplot(exp(chicago$Number_of_violent_crimes),
col="gray",
ylab = "exp(Number of violent crimes)",
main = "Box plot of exp(chicago$Number_of_violent_crimes)")
sqrt, log, or exp)."
I chose SQRT because the distribution of the transformed data is nearly normal and data is evenly distributed based on the box plot as well. The distribution of LOG also is close to normal, but it is a bit skewed, and it seems that SQRT works a bit better for this dara.
"Now we want to fit models to predict the association between neighborhood level socioeconomic characteristics and crime. This is an academic exercise but it reflects many real world situations, where one is faced with a large set of potential models to answer a somewhat vague policy/empirical question.
For example, the model below (which has an adjusted R-square of about 0.5), has a selection of variables that might be associated with the prevalence of violent crime:
mod1 = lm(adj_violent_crimes ~ Population_Density +
Percent_of_children_in_single_female_headed_household +
Percent_of_households_with_no_car +
Percent_Less_than_High_School_Education_ +
Percent_with_a_Graduate_Degree_ +
Percent_Black_Alone +
Median_House_Value +
Median_Rent_ +
Income_Gini_Coefficient_,
data=chicago@data)
Produce a histogram and scatterplot of the above model’s residuals.
hist(resid(mod1))
plot(resid(mod1) ~ predict(mod1))
abline(0, 0) # the horizon
Are there any major issues that we should be concerend about?
" No, Here is the checklist:
heteroskedasticity (non-constant variance): The residuals roughly form a horizontal band around the 0 line. This suggests that the variances of the error terms are equal.
influential observations (possible outliers): there are a few outlires, but their value is not very large and cannot be considered a major issue. But in general non of them do stands out from the basic random pattern of residuals.
pattern: There is not any pattern in the plot and The residuals bounce randomly around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
"# Jenks categories of residuals
cats5 = classIntervals(resid(mod1), n=5, style="jenks")
# Five class color palette
pal5 = brewer.pal(5, "YlOrRd")
# Attach colors to categories
five_cols = findColours(cats5, pal5)
# Plot map
plot(chicago, col=five_cols, lty=0)
"Yes, the residuals seem spatially random, and no pattern can be seen in the residual map"
adj_violent_crimes ~ Population_Density:# Basic model
m = lm(adj_violent_crimes ~ Population_Density, data=chicago@data)
# Jenks categories of residuals
cats5 = classIntervals(resid(m), n=5, style="jenks")
# Five class color palette
pal5 = brewer.pal(5, "YlOrRd")
# Attach colors to categories
five_cols = findColours(cats5, pal5)
# Plot map
plot(chicago, col=five_cols, lty=0)
When the residuals from a model show clear spatial patterns there is evidence of some sort of missing variable, this is sometimes called ‘model mispecification’. The patterns in the model residuals suggest some variable or variables, missing from the model, are strongly related to the outcome of interest. For example, the plot shown below plots the residuals from the minimalist model above against percent African-American. Two things are striking in this plot:
It’s also important to check models for multicollinearity. This is done by calculating the Variance Inflation Factor (VIF) as follows:
mod1 = lm(adj_violent_crimes ~ Population_Density +
Percent_of_children_in_single_female_headed_household +
Percent_of_households_with_no_car +
Percent_Less_than_High_School_Education_ +
Percent_with_a_Graduate_Degree_ +
Percent_Black_Alone +
Median_House_Value +
Median_Rent_ +
Income_Gini_Coefficient_,
data=chicago@data)
library(car)
vif(mod1)
## Population_Density
## 1.234791
## Percent_of_children_in_single_female_headed_household
## 4.052686
## Percent_of_households_with_no_car
## 1.751760
## Percent_Less_than_High_School_Education_
## 3.054951
## Percent_with_a_Graduate_Degree_
## 2.332825
## Percent_Black_Alone
## 3.969937
## Median_House_Value
## 2.518863
## Median_Rent_
## 1.377821
## Income_Gini_Coefficient_
## 2.434391
Go online and research a good/common ‘cutoff’ for VIF measures and provide your source. Use this ‘cutoff’ to assess the collinearity of the above model.
"
The cutoff is 2.5, and we should be concern when a VIF is greater than 2.50.
refrences:
http://statisticalhorizons.com/multicollinearity
http://psychologicalstatistics.blogspot.com/2013/11/multicollinearity-and-collinearity-in.html
Also based on another source, we can use this: sqrt(vif(fit)) > 2
http://www.statmethods.net/stats/rdiagnostics.html
" mod1_variables = data.frame(poulationDensity = chicago@data$Population_Density, singleFemaleHeaded = chicago@data$Percent_of_children_in_single_female_headed_household, noCar = chicago@data$Percent_of_households_with_no_car, LessThanHighSchool = chicago@data$Percent_Less_than_High_School_Education_, aGraduateDegree = chicago@data$Percent_with_a_Graduate_Degree_, BlackAlone = chicago@data$Percent_Black_Alone, HouseValue = chicago@data$Median_House_Value, Rent = chicago@data$Median_Rent_, Gini = chicago@data$Income_Gini_Coefficient_)
mod1_variables_corr = cor(mod1_variables, use = "complete.obs", method = "pearson") # get correlations
library('corrplot') #package corrplot
corrplot(mod1_variables_corr, method = "circle", addCoef.col = "black") #plot matrix
#So, based on the correlation matrix, female headed households, less than high school education, and percent black are highly correlated.
It is possible that these three variables are correlated with each other, by dropping two we may be able to improve the model.
mod1_red = lm(adj_violent_crimes ~ Population_Density +
Percent_of_households_with_no_car +
Percent_with_a_Graduate_Degree_ +
Percent_Black_Alone +
Median_House_Value +
Median_Rent_ +
Income_Gini_Coefficient_,
data=chicago@data)
vif(mod1_red)
## Population_Density Percent_of_households_with_no_car
## 1.227744 1.505899
## Percent_with_a_Graduate_Degree_ Percent_Black_Alone
## 2.086370 2.079539
## Median_House_Value Median_Rent_
## 2.489305 1.325998
## Income_Gini_Coefficient_
## 1.680464Do you notice an improvement in VIFs?
"
Yes, now there is not any VIF value greater than 2.5
"Perform an ANOVA test of these two nested models. Compare the full and reduced forms of the model.
# it tests whether reduction in the residual sum of squares are statistically significant or not
anova(mod1, mod1_red)
## Analysis of Variance Table
##
## Model 1: adj_violent_crimes ~ Population_Density + Percent_of_children_in_single_female_headed_household +
## Percent_of_households_with_no_car + Percent_Less_than_High_School_Education_ +
## Percent_with_a_Graduate_Degree_ + Percent_Black_Alone + Median_House_Value +
## Median_Rent_ + Income_Gini_Coefficient_
## Model 2: adj_violent_crimes ~ Population_Density + Percent_of_households_with_no_car +
## Percent_with_a_Graduate_Degree_ + Percent_Black_Alone + Median_House_Value +
## Median_Rent_ + Income_Gini_Coefficient_
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 640 2014.8
## 2 642 2018.9 -2 -4.0811 0.6482 0.5233
# So based on the p-value reduction in the residual sum of squares are not statistically significant, and the reduced model would be fineAs a final diagnostic we might want to examine the marginal contribution of each variable to the model, its ‘extra sum of squares’. The modelEffectSizes function in the lmSupport library provides partial R-Squared values, which it calls the pEta-sqr. We see that the percent black alone variable has the highest partial R-square but we previously saw that race was correlated with other factors that might be related to violent crime. Should we have dropped race from the model and included education and female headed households? Understanding that race is an important factor for violent crime prevalence in Chicago is an important insight, but this should be tempered by the fact that race, as we saw, is highly correlated with other factors that are also associated with crime. Its very hard, many would say impossible, to make causal statements using a model like this. Causal statments would be of the nature, “a X% rise in the number of single parent female headed households causes a rise of X in the number of crimes.” There is a maixm in statistics, “there is no causation without manipulation”, that is, you can’t establish causal relationships without experimental manipulation. In the social sciences “manipulations” are expensive, ethically complicated, and rarely done. This may not be the case in the physical sciences.
We can examine the marginal contribution of each variable to the model, that is, its impact on the model given that all other variables are already in the model:
library(lmSupport) # You'll probably need to install this one
# Think seriously about what we are measuring here (read the package help for more details)
# When talking about these results, break down this test in terms of effect sizes...
modelEffectSizes(mod1_red)
## lm(formula = adj_violent_crimes ~ Population_Density + Percent_of_households_with_no_car +
## Percent_with_a_Graduate_Degree_ + Percent_Black_Alone + Median_House_Value +
## Median_Rent_ + Income_Gini_Coefficient_, data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 14.8060 1 0.0073 NA
## Population_Density 79.0720 1 0.0377 0.0187
## Percent_of_households_with_no_car 72.5148 1 0.0347 0.0172
## Percent_with_a_Graduate_Degree_ 75.0802 1 0.0359 0.0178
## Percent_Black_Alone 525.5997 1 0.2066 0.1244
## Median_House_Value 25.9398 1 0.0127 0.0061
## Median_Rent_ 39.9029 1 0.0194 0.0094
## Income_Gini_Coefficient_ 21.0774 1 0.0103 0.0050
##
## Sum of squared errors (SSE): 2018.9
## Sum of squared total (SST): 4224.6
For this lab you have to make 4 regression models (you’ve already done one!). One model should be for violent crime (the variable used in the demonstration), the other three variables can be any type of crime selected from the Chicago data. For each model:
Overall, do you notice any patterns across your four models? Are certain variables routinely associated with crime? Be careful in your interpretation of model coefficients. Why do you think these factors seem associated with crime? For example, in the model above we say that percent black was the most influential predictor of violent crime frequency. I wonder what the race variable is actually measuring, could it be a proxy for other variables that are highly correlated with the racial composition of neighborhoods and also influence crime? For example, ‘latent’ constructs like structural disadvantage and lack of access to opportunity?
This lab has been adapted from a lab developed by Dr. Seth Spielman at the University of Colorado at Boulder.
In order to determine the neighborhood level characteristics of crime, four dependent crime variables were modeled using crime data from the city of Chicago and Census Tract Level data from the American Community Survey. In order to determine how much of the variance in different types of crime could be explained by independent variables, four multiple linear regression models were built and evaluated.
____________________________________________________________________ ## Functions
Summarize the data and/or provide a few visualizations (maps, plots). Map the dependent variable (a type of crime)
What general trends do you observe?
In what ways does your final model explain the crime type you chose? Pay special attention to the sign and significance of each variable. Do the predictors align with your intuition about the association between social/economic characteristics and the frequency of crime?
1-1- How do the explanatory variables you chose reflect the crime type you are modeling?
"
The Type of Crime (Dependent Variable): Number_of_violent_crimes
Variables of Interest (Independent Variable):
Population_Density
Percent_with_a_Graduate_Degree_
Percent_Black_Alone
Median_House_Value
Median_Rent_
Income_Gini_Coefficient_
Drug_Abuse
Percent_Employed_in_Service_Occupations
Percent_of_children_in_single_female_headed_household
PCT_of_Households_with_cash_public_assistance
For the number of violent crimes, I started with a set of related variables. for example, higher population density in a neighborhood leads to higher crime rate, higher educated people decreases the number of violent crimes, percentage of black alone can be considered as a potential factor for increasing the number of violent crimes, poor neighborhoods (low house value, low rent, low income, less employed, households who recieve public assistance) also can be anoher a potential factor, higher druge abuse leads to more violent crimes.
# I first run these above variables in univariant regression against one another.
plot(sqrt(Number_of_violent_crimes) ~ (Population_Density), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ Population_Density, data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 2.501e-06, Adjusted R-squared: -0.001517
F-statistic: 0.001645 on 1 and 658 DF, p-value: 0.9677
plot(sqrt(Number_of_violent_crimes) ~ (Percent_with_a_Graduate_Degree_), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ Percent_with_a_Graduate_Degree_, data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.09606, Adjusted R-squared: 0.09468
F-statistic: 69.92 on 1 and 658 DF, p-value: 3.679e-16
plot(sqrt(chicago@data$Number_of_violent_crimes) , chicago@data$Percent_Black_Alon)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Percent_Black_Alone), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.4153, Adjusted R-squared: 0.4144
F-statistic: 467.4 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Number_of_violent_crimes) ~ (Median_House_Value), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Median_House_Value), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.1826, Adjusted R-squared: 0.1814
F-statistic: 145.7 on 1 and 652 DF, p-value: < 2.2e-16
plot(sqrt(Number_of_violent_crimes) ~ (Median_Rent_), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Median_Rent_), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.01443, Adjusted R-squared: 0.01293
F-statistic: 9.579 on 1 and 654 DF, p-value: 0.002053
plot(sqrt(Number_of_violent_crimes) ~ (Income_Gini_Coefficient_), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Income_Gini_Coefficient_), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.0484, Adjusted R-squared: 0.04696
F-statistic: 33.47 on 1 and 658 DF, p-value: 1.12e-08
plot(sqrt(Number_of_violent_crimes) ~ (Drug_Abuse), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Drug_Abuse), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.4169, Adjusted R-squared: 0.416
F-statistic: 470.5 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Number_of_violent_crimes) ~ (Percent_Employed_in_Service_Occupations), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Percent_Employed_in_Service_Occupations), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.1073, Adjusted R-squared: 0.1059
F-statistic: 79.09 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Number_of_violent_crimes) ~ (Percent_of_children_in_single_female_headed_household), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (Percent_of_children_in_single_female_headed_household), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.3577, Adjusted R-squared: 0.3568
F-statistic: 366.5 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Number_of_violent_crimes) ~ (PCT_of_Households_with_cash_public_assistance), data = chicago@data)
violent_crimes <- lm(sqrt(Number_of_violent_crimes) ~ (PCT_of_Households_with_cash_public_assistance ), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.3643, Adjusted R-squared: 0.3633
F-statistic: 377.1 on 1 and 658 DF, p-value: < 2.2e-16
Most of the variables are significant, except, Percent_with_a_Graduate_Degree_, Percent_Employed_in_Service_occupations, Percent_of_children_in_single_female_headed_household, and Income_Gini_Coefficient_.
"
2-1- Summarize the data and/or provide a few visualizations (maps, plots). Map the dependent variable (a type of crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.00 27.00 36.57 52.00 163.00
2-2- What general trends do you observe?
"Based on the map of number of violent crimes, we can see that this crime is not random and is clustered. Box plot shows that there are some outliers in the data."
3-1- Did you combine any fields, create dummy variables, transform variables, subset the data, etc?
# Box plot shows that data is not normal and there are some outliers in the data. Square root transform is used to normalize the data
# Shapiro-Wilk normality test
shapiro.test(chicago@data$Number_of_violent_crimes) # W = 0.86711, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: chicago@data$Number_of_violent_crimes
## W = 0.86711, p-value < 2.2e-16
shapiro.test(sqrt(chicago@data$Number_of_violent_crimes)) # W = 0.98138, p-value = 1.934e-07 (better)
##
## Shapiro-Wilk normality test
##
## data: sqrt(chicago@data$Number_of_violent_crimes)
## W = 0.98138, p-value = 1.934e-07
#Create a new variable called adj_violent_crimes that ‘adjusts’ for potential outliers.
chicago$adj_violent_crimes = sqrt(chicago$Number_of_violent_crimes)
# if we map the transformed data, we can see a random spatial patterns.
4-1- LM summary
summary(mod1)
##
## Call:
## lm(formula = adj_violent_crimes ~ Population_Density + Percent_with_a_Graduate_Degree_ +
## Percent_Black_Alone + Median_House_Value + Median_Rent_ +
## Drug_Abuse + Percent_Employed_in_Service_Occupations + Percent_of_children_in_single_female_headed_household +
## PCT_of_Households_with_cash_public_assistance + Income_Gini_Coefficient_,
## data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6179 -1.0512 -0.0608 1.0778 7.2197
##
## Coefficients:
## Estimate
## (Intercept) 2.389e+00
## Population_Density 4.412e-05
## Percent_with_a_Graduate_Degree_ -2.676e-03
## Percent_Black_Alone 1.961e-02
## Median_House_Value -2.822e-06
## Median_Rent_ 8.110e-04
## Drug_Abuse 1.289e-02
## Percent_Employed_in_Service_Occupations -3.538e-03
## Percent_of_children_in_single_female_headed_household 3.741e-03
## PCT_of_Households_with_cash_public_assistance 1.862e-02
## Income_Gini_Coefficient_ 1.669e+00
## Std. Error t value
## (Intercept) 6.339e-01 3.769
## Population_Density 8.048e-06 5.482
## Percent_with_a_Graduate_Degree_ 1.058e-02 -0.253
## Percent_Black_Alone 3.042e-03 6.445
## Median_House_Value 6.502e-07 -4.341
## Median_Rent_ 3.314e-04 2.447
## Drug_Abuse 1.021e-03 12.629
## Percent_Employed_in_Service_Occupations 8.819e-03 -0.401
## Percent_of_children_in_single_female_headed_household 4.517e-03 0.828
## PCT_of_Households_with_cash_public_assistance 7.668e-03 2.428
## Income_Gini_Coefficient_ 1.187e+00 1.406
## Pr(>|t|)
## (Intercept) 0.000179 ***
## Population_Density 6.07e-08 ***
## Percent_with_a_Graduate_Degree_ 0.800426
## Percent_Black_Alone 2.28e-10 ***
## Median_House_Value 1.65e-05 ***
## Median_Rent_ 0.014676 *
## Drug_Abuse < 2e-16 ***
## Percent_Employed_in_Service_Occupations 0.688419
## Percent_of_children_in_single_female_headed_household 0.407807
## PCT_of_Households_with_cash_public_assistance 0.015463 *
## Income_Gini_Coefficient_ 0.160119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.588 on 639 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.6185, Adjusted R-squared: 0.6125
## F-statistic: 103.6 on 10 and 639 DF, p-value: < 2.2e-16
4-2- Interpret model coefficients/p-values, model rˆ2/F-Test
"Most of the variables are significant, except, Percent_with_a_Graduate_Degree_, Percent_Employed_in_Service_occupations, Percent_of_children_in_single_female_headed_household, and Income_Gini_Coefficient_.
The R-square is high that means our dependent variales explains most of the variability of the adj_violent_crimes in the the regression model.
The p-value of our model is very small and it means that there is a significant relationship between the variables in the regression model.
Median_Rent_ and Income_Gini_Coefficient_ coefficients seems sucpitious, because their increase lead to increase in the violent crime
"
4-3- Model Diagnostics
4-3-1- Residual plots
plot(mod1)
hist(resid(mod1))
plot(resid(mod1) ~ predict(mod1))
abline(0, 0) # the horizon
4-3-1-1- Heteroskedasticity (non-constant variance)
# Evaluate heteroscedasticity
# If the test is positive (low p value), you should see if any transformation of the dependent variable helps you eliminate heteroscedasticity
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lmtest)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 48.395, df = 10, p-value = 5.256e-07
" There is heteroscedasticity
"
4-3-1-2- Influential observations (possible outliers)
# Bonferroni p-values to assesse Outliers
library(car)
outlierTest(mod1) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferonni p
## 19 4.673857 3.6082e-06 0.0023453
## 312 -4.264718 2.3046e-05 0.0149800
" There are 2 outliers based on Bonferroni test (19 and 312). But 237, 20, and 551 also are apparantly outliers based on the regression plots.
"
4-3-1-3- pattern (Nonlinearity)
# When the residuals bounce randomly around the 0 line, it suggests that the assumption that the relationship is linear is reasonable
# Evaluate Nonlinearity
# component + residual plot
crPlots(mod1)
" Drug-Abuse and median house value are not linear.
"
4-3-1-4- Normality of Residuals
qqPlot(mod1, main="QQ Plot") #qq plot for studentized resid
# distribution of studentized residuals
library(MASS)
sresid <- studres(mod1)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
#Null hypothesis residuals are normally distributed
shapiro.test(resid(mod1))
##
## Shapiro-Wilk normality test
##
## data: resid(mod1)
## W = 0.99112, p-value = 0.0006023
" Residuals are nearly normal
"
4-3-1-5- Non-independence of Errors
# Test for Autocorrelated Errors (verifies if the residuals from a linear model are correlated or not)
library(car)
durbinWatsonTest(mod1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2759873 1.447699 0
## Alternative hypothesis: rho != 0
# The null hypothesis (H0H0) is that there is no correlation among residuals, i.e., they are independent.
# The alternative hypothesis (HaHa) is that residuals are autocorrelated.
" Residuals are autocorrelated
"
4-3-2- map residuals
" There is not a clear spatial pattern in the residual map.
"
4-3-3- check for multicollinearity (VIF) (Note: the cutoff is 2.5)
# Evaluate Collinearity
library(car)
vif(mod1) # variance inflation factors
## Population_Density
## 1.172813
## Percent_with_a_Graduate_Degree_
## 3.017687
## Percent_Black_Alone
## 4.337218
## Median_House_Value
## 2.509130
## Median_Rent_
## 1.384075
## Drug_Abuse
## 1.473562
## Percent_Employed_in_Service_Occupations
## 1.674395
## Percent_of_children_in_single_female_headed_household
## 3.811204
## PCT_of_Households_with_cash_public_assistance
## 3.392947
## Income_Gini_Coefficient_
## 1.790531
"
The cuttoff is 2.5, and here four of them are higher than this cuttoff and we should eliminate 3 of them.
"
4-3-4- examine the partial R-Square for each variable
library(lmSupport)
modelEffectSizes(mod1)
## lm(formula = adj_violent_crimes ~ Population_Density + Percent_with_a_Graduate_Degree_ +
## Percent_Black_Alone + Median_House_Value + Median_Rent_ +
## Drug_Abuse + Percent_Employed_in_Service_Occupations + Percent_of_children_in_single_female_headed_household +
## PCT_of_Households_with_cash_public_assistance + Income_Gini_Coefficient_,
## data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr
## (Intercept) 35.8347 1 0.0217
## Population_Density 75.7935 1 0.0449
## Percent_with_a_Graduate_Degree_ 0.1613 1 0.0001
## Percent_Black_Alone 104.7788 1 0.0610
## Median_House_Value 47.5297 1 0.0286
## Median_Rent_ 15.1033 1 0.0093
## Drug_Abuse 402.2937 1 0.1997
## Percent_Employed_in_Service_Occupations 0.4060 1 0.0003
## Percent_of_children_in_single_female_headed_household 1.7306 1 0.0011
## PCT_of_Households_with_cash_public_assistance 14.8687 1 0.0091
## Income_Gini_Coefficient_ 4.9887 1 0.0031
## dR-sqr
## (Intercept) NA
## Population_Density 0.0179
## Percent_with_a_Graduate_Degree_ 0.0000
## Percent_Black_Alone 0.0248
## Median_House_Value 0.0113
## Median_Rent_ 0.0036
## Drug_Abuse 0.0952
## Percent_Employed_in_Service_Occupations 0.0001
## Percent_of_children_in_single_female_headed_household 0.0004
## PCT_of_Households_with_cash_public_assistance 0.0035
## Income_Gini_Coefficient_ 0.0012
##
## Sum of squared errors (SSE): 1611.9
## Sum of squared total (SST): 4224.6
"
From the table generated from this function, the pEta-sqr is highest for Drug_Abuse, at 0.1997,
and lowest for Percent_with_a_Graduate_Degree_, at 0.0001. These pEta-sqr values reflect the marginal contribution of each variable to the model, its impact on the model given that all other variables
are already in the model.
"
5-1- What variables did you add/remove?
"A VIF test was conducted on the model to check variable multicolinearity. Results indicated that some variables may have correlations with one another. To compensate, 3 of the them, Percent_Black_Alone ,
Percent_of_children_in_single_female_headed_household and Percent_with_a_Graduate_Degree_ , were omitted."
5-2- check for multicollinearity (VIF)
A VIF on reduced model is provided below. Removing these variables did improve the VIFs:
## Population_Density
## 1.082460
## Median_House_Value
## 1.875354
## Median_Rent_
## 1.286985
## Drug_Abuse
## 1.399832
## Percent_Employed_in_Service_Occupations
## 1.488740
## PCT_of_Households_with_cash_public_assistance
## 2.309115
## Income_Gini_Coefficient_
## 1.408851
5-3- Diagnostic tests: AIC, Anova, extra sum of squares (modelEffectSizes, partial R-Squared), etc.
# Akaike Information Criterion (AIC)
# Small values are desirable
AIC(mod1) # 2458.928
## [1] 2458.928
AIC(mod1_red1) # 2530.21
## [1] 2530.21
# AIC has increased
#Anova: it tests whether reduction in the residual sum of squares are statistically significant or not
anova(mod1, mod1_red1)
## Analysis of Variance Table
##
## Model 1: adj_violent_crimes ~ Population_Density + Percent_with_a_Graduate_Degree_ +
## Percent_Black_Alone + Median_House_Value + Median_Rent_ +
## Drug_Abuse + Percent_Employed_in_Service_Occupations + Percent_of_children_in_single_female_headed_household +
## PCT_of_Households_with_cash_public_assistance + Income_Gini_Coefficient_
## Model 2: adj_violent_crimes ~ Population_Density + Median_House_Value +
## Median_Rent_ + Drug_Abuse + Percent_Employed_in_Service_Occupations +
## PCT_of_Households_with_cash_public_assistance + Income_Gini_Coefficient_
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 639 1611.8
## 2 642 1815.3 -3 -203.5 26.892 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Full model explains significantly more variance
# extra sum of squares (modelEffectSizes, partial R-Squared)
modelEffectSizes(mod1_red1)
## lm(formula = adj_violent_crimes ~ Population_Density + Median_House_Value +
## Median_Rent_ + Drug_Abuse + Percent_Employed_in_Service_Occupations +
## PCT_of_Households_with_cash_public_assistance + Income_Gini_Coefficient_,
## data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 17.5435 1 0.0096 NA
## Population_Density 27.0390 1 0.0147 0.0064
## Median_House_Value 161.8894 1 0.0819 0.0383
## Median_Rent_ 44.2973 1 0.0238 0.0105
## Drug_Abuse 540.0902 1 0.2293 0.1278
## Percent_Employed_in_Service_Occupations 0.0646 1 0.0000 0.0000
## PCT_of_Households_with_cash_public_assistance 140.9315 1 0.0720 0.0334
## Income_Gini_Coefficient_ 52.4329 1 0.0281 0.0124
##
## Sum of squared errors (SSE): 1815.4
## Sum of squared total (SST): 4224.6
# Drug Abuse had the highest partial sum of squares value, at 0.2416, showing it contributed most to the violent crime model. Percent_Employed_in_Service_Occupations, on the other hand has the smallest partial sum of square
6-1- Follow the same workflow as the “1st Model Specification”/“Model adjustment” sections. You should aim to run through ˜2 more model iterations. Keep in mind that your final model does not need to be perfect, you will be explaining its shortcomings in the final model summary.
#I ran the model several times and here is the final model:
mod1_final = lm(adj_violent_crimes ~
Median_House_Value +
Percent_of_children_in_single_female_headed_household +
sqrt(Drug_Abuse) ,
data=chicago@data)
summary(mod1_final)
##
## Call:
## lm(formula = adj_violent_crimes ~ Median_House_Value + Percent_of_children_in_single_female_headed_household +
## sqrt(Drug_Abuse), data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7615 -0.9075 -0.0031 0.9166 5.7257
##
## Coefficients:
## Estimate
## (Intercept) 2.933e+00
## Median_House_Value -1.497e-06
## Percent_of_children_in_single_female_headed_household 1.622e-02
## sqrt(Drug_Abuse) 4.290e-01
## Std. Error t value
## (Intercept) 2.206e-01 13.296
## Median_House_Value 4.214e-07 -3.553
## Percent_of_children_in_single_female_headed_household 2.743e-03 5.913
## sqrt(Drug_Abuse) 1.866e-02 22.987
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Median_House_Value 0.000409 ***
## Percent_of_children_in_single_female_headed_household 5.44e-09 ***
## sqrt(Drug_Abuse) < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.454 on 650 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.6768, Adjusted R-squared: 0.6753
## F-statistic: 453.7 on 3 and 650 DF, p-value: < 2.2e-16
# Although it seems taht Drug Abuse does not have a linear relationship with violent crime, but the partial sum of squares value illustrate that it contributes most to the violent crime model.
# Percent_Black_Alon is significant but there are two groups of them in the plot violent crimes that we cannot model them here.
# We tranform the Drug_Abuse using SQRT, and it significantly imrove the model.
# This model has the samllest AIC
# Still this model has homoscedasticity and the residuals are autocorrelated.
7-1- In what ways does your final model explain the crime type you chose? Pay special attention to the sign and significance of each variable. Do the predictors align with your intuition about the association between social/economic characteristics and the frequency of crime?
" To explain our crime type (number of violent crime, we came up with three variable that can explain the variation of out moeld better. 1) Median House Value: this variable is not very significant, but has a negative coefficient, and so its increase leads to decreasing in the number of violent crime, that is based on my prediction in the begining. 2) Percent of children in single female headed household is the second significant variable in our model and with a positive coefficient has a direct relationship with number of violent crime. This predictor also align with my intuition about the association between social/economic characteristics and the number of violent crime. 3) Drug Abuse is the most significant variabel of this model. It has a positive coefficient that makes sense.
"
7-2- What shortcomings does your final model have? In what ways is the model’s explanatory power limited? Does the model still violate assumptions of OLS regression (i.e. heteroskedasticity, residual [spatial]autocorrelation)? (Explain how these problems limit your ability to draw inferences about the crime type you chose.) How might the characteristics of the variables you chose influence these problems? [*]
"
Our final model has a R-square of 0.6768 that is considered fine in a linear regression model, but still the results are not completly satisfying. First of all, it seems that the relationship between the the number of violent crime and drug dbuse, which is the most significant variabel of our model, is not linear. Making this assumption that we have only linear relationships resulted in homoscedasticity and autocorrelated residuals in our model. Furthurmore, suspected variables were removed, and the model re-tested. In most cases, while the model’s VIFs were improved with this approach, the variance explained by the model (as measured by ANOVA), decreased.i.e., The full model explains significantly more variance.
"
1-1- How do the explanatory variables you chose reflect the crime type you are modeling?
"
The Type of Crime (Dependent Variable): Vehicle_Theft
Variables of Interest (Independent Variable):
Percent_with_a_Graduate_Degree_
Percent_Black_Alone
Median_House_Value
Number_of_property_crimes
Weapons_Violation
Percent_Hispanic_
Burglary
Liquor_License
Vandalism
Percent_of_families_that_consist_of_married_couples
Drug_Abuse
For the number of violent crimes, I started with a set of related variables. for example, it seems that more graduate degree, and number of married couples can decrese Vehicle Theft, and on the other hand more Black Alone, property crimes, Weapons violation, Burglary, Drug Abuse, and Vandalism can increase Vehicle Theft. I am not sure aboutMedian House Value, Percent Hispanic, and Liquor License, but I like to investigate the data.
# I first run these above variables in univariant regression against one another.
plot(sqrt(Vehicle_Theft) ~ (Percent_with_a_Graduate_Degree_ ), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ Percent_with_a_Graduate_Degree_ , data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.04863, Adjusted R-squared: 0.04719
F-statistic: 33.64 on 1 and 658 DF, p-value: 1.031e-08
plot(sqrt(Vehicle_Theft) ~ (Percent_Black_Alone), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ Percent_Black_Alone, data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.01896, Adjusted R-squared: 0.01747
F-statistic: 12.72 on 1 and 658 DF, p-value: 0.000389
plot(sqrt(chicago@data$Vehicle_Theft) , chicago@data$Median_House_Value)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Median_House_Value), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.01792, Adjusted R-squared: 0.01642
F-statistic: 11.9 on 1 and 652 DF, p-value: 0.0005975
plot(sqrt(Vehicle_Theft) ~ (Number_of_property_crimes), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Number_of_property_crimes), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.2424, Adjusted R-squared: 0.2412
F-statistic: 210.5 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Vehicle_Theft) ~ (Weapons_Violation), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Weapons_Violation), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.1787, Adjusted R-squared: 0.1775
F-statistic: 143.2 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Vehicle_Theft) ~ (Percent_Hispanic_), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Percent_Hispanic_), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.01386, Adjusted R-squared: 0.01236
F-statistic: 9.248 on 1 and 658 DF, p-value: 0.002452
plot(sqrt(Vehicle_Theft) ~ (Burglary), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Burglary), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.2952, Adjusted R-squared: 0.2942
F-statistic: 275.7 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Vehicle_Theft) ~ (Liquor_License), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Liquor_License), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.0478, Adjusted R-squared: 0.04635
F-statistic: 33.03 on 1 and 658 DF, p-value: 1.389e-08
plot(sqrt(Vehicle_Theft) ~ (Vandalism), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Vandalism), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.4178, Adjusted R-squared: 0.417
F-statistic: 472.3 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Vehicle_Theft) ~ (Percent_of_families_that_consist_of_married_couples), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Percent_of_families_that_consist_of_married_couples ), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.02854, Adjusted R-squared: 0.02707
F-statistic: 19.33 on 1 and 658 DF, p-value: 1.28e-05
plot(sqrt(Vehicle_Theft) ~ (Drug_Abuse), data = chicago@data)
violent_crimes <- lm(sqrt(Vehicle_Theft) ~ (Drug_Abuse), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.2267, Adjusted R-squared: 0.2256
F-statistic: 192.9 on 1 and 658 DF, p-value: < 2.2e-16
All of the variables are significant, in the first one to one regression model test.
"
2-1- Summarize the data and/or provide a few visualizations (maps, plots). Map the dependent variable (a type of crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 18.00 20.21 27.00 101.00
2-2- What general trends do you observe?
"Based on the map of number of Vehicle Theft, we can see that this crime is not random and is clustered. Box plot shows that there are some outliers in the data."
3-1- Did you combine any fields, create dummy variables, transform variables, subset the data, etc?
# Box plot shows that data is not normal and there are some outliers in the data. Square root transform is used to normalize the data
# Shapiro-Wilk normality test
shapiro.test(chicago@data$Vehicle_Theft) # W = 0.92161, p-value < 2.2e-16 < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: chicago@data$Vehicle_Theft
## W = 0.92161, p-value < 2.2e-16
shapiro.test(sqrt(chicago@data$Vehicle_Theft)) # W = 0.99351, p-value = 0.005998 (better)
##
## Shapiro-Wilk normality test
##
## data: sqrt(chicago@data$Vehicle_Theft)
## W = 0.99351, p-value = 0.005998
#Create a new variable called adj_Vehicle_Theft that ‘adjusts’ for potential outliers.
chicago$adj_Vehicle_Theft = sqrt(chicago$Vehicle_Theft)
# if we map the transformed data, we can see a random spatial patterns.
4-1- LM summary
summary(mod1)
##
## Call:
## lm(formula = adj_Vehicle_Theft ~ Population_Density + Percent_with_a_Graduate_Degree_ +
## Percent_Black_Alone + Median_House_Value + Number_of_property_crimes +
## Weapons_Violation + Percent_Hispanic_ + Burglary + Liquor_License +
## Vandalism + Percent_of_families_that_consist_of_married_couples +
## Drug_Abuse, data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2653 -0.7004 0.0331 0.7535 4.1643
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.398e+00 3.498e-01
## Population_Density 7.411e-06 6.183e-06
## Percent_with_a_Graduate_Degree_ -1.383e-02 6.768e-03
## Percent_Black_Alone 5.536e-03 2.453e-03
## Median_House_Value 1.816e-06 4.507e-07
## Number_of_property_crimes 2.928e-03 4.995e-04
## Weapons_Violation -1.397e-02 1.264e-02
## Percent_Hispanic_ 1.632e-02 2.683e-03
## Burglary 6.918e-03 3.508e-03
## Liquor_License 9.829e-03 4.067e-02
## Vandalism 2.555e-02 3.322e-03
## Percent_of_families_that_consist_of_married_couples -5.446e-03 4.183e-03
## Drug_Abuse 1.487e-03 8.171e-04
## t value Pr(>|t|)
## (Intercept) 3.996 7.18e-05 ***
## Population_Density 1.199 0.2311
## Percent_with_a_Graduate_Degree_ -2.044 0.0414 *
## Percent_Black_Alone 2.257 0.0243 *
## Median_House_Value 4.029 6.28e-05 ***
## Number_of_property_crimes 5.861 7.36e-09 ***
## Weapons_Violation -1.105 0.2698
## Percent_Hispanic_ 6.081 2.05e-09 ***
## Burglary 1.972 0.0490 *
## Liquor_License 0.242 0.8091
## Vandalism 7.691 5.52e-14 ***
## Percent_of_families_that_consist_of_married_couples -1.302 0.1934
## Drug_Abuse 1.820 0.0692 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.088 on 641 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.5244, Adjusted R-squared: 0.5155
## F-statistic: 58.91 on 12 and 641 DF, p-value: < 2.2e-16
4-2- Interpret model coefficients/p-values, model rˆ2/F-Test
"Most of the variables are significant, except, Population_Density, Weapons_Violation, Liquor_License, Percent_of_families_that_consist_of_married_couples, Drug_Abuse.
The R-square is good (0.5244) that means our dependent variales explains half of the variability of the adj_Vehicle_Theft in the the regression model.
The p-value of our model is very small and it means that there is a significant relationship between the variables in the regression model.
Weapons_Violation coefficients seems sucpitious, because their increase lead to decrease in the violent crime
"
4-3- Model Diagnostics
4-3-1- Residual plots
plot(mod1)
hist(resid(mod1))
plot(resid(mod1) ~ predict(mod1))
abline(0, 0) # the horizon
4-3-1-1- Heteroskedasticity (non-constant variance)
# Evaluate heteroscedasticity
# If the test is positive (low p value), you should see if any transformation of the dependent variable helps you eliminate heteroscedasticity
library(zoo)
library(lmtest)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 165.06, df = 12, p-value < 2.2e-16
" There is heteroscedasticity
"
4-3-1-2- Influential observations (possible outliers)
# Bonferroni p-values to assesse Outliers
library(car)
outlierTest(mod1) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferonni p
## 19 -6.546750 1.2070e-10 7.8939e-08
## 520 -4.343134 1.6329e-05 1.0679e-02
" There are 2 outliers based on Bonferroni test (19 and 520). But 481 also is apparantly outliers based on the regression plots.
"
4-3-1-3- pattern (Nonlinearity)
# When the residuals bounce randomly around the 0 line, it suggests that the assumption that the relationship is linear is reasonable
# Evaluate Nonlinearity
# component + residual plot
crPlots(mod1)
## Warning in smoother(.x, partial.res[, var], col = col.lines[2], log.x =
## FALSE, : could not fit smooth
" Number_of_property_crimes and median house value, and Vandalism are not linear.
"
4-3-1-4- Normality of Residuals
qqPlot(mod1, main="QQ Plot") #qq plot for studentized resid
# distribution of studentized residuals
library(MASS)
sresid <- studres(mod1)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
#Null hypothesis residuals are normally distributed
shapiro.test(resid(mod1))
##
## Shapiro-Wilk normality test
##
## data: resid(mod1)
## W = 0.99003, p-value = 0.0002048
" Residuals are nearly normal. It seems there are some outliers.
"
4-3-1-5- Non-independence of Errors
# Test for Autocorrelated Errors (verifies if the residuals from a linear model are correlated or not)
library(car)
durbinWatsonTest(mod1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2561444 1.483877 0
## Alternative hypothesis: rho != 0
# The null hypothesis (H0H0) is that there is no correlation among residuals, i.e., they are independent.
# The alternative hypothesis (HaHa) is that residuals are autocorrelated.
" Residuals are autocorrelated
"
4-3-2- map residuals
" There is not a clear spatial pattern in the residual map.
"
4-3-3- check for multicollinearity (VIF) (Note: the cutoff is 2.5)
# Evaluate Collinearity
library(car)
vif(mod1) # variance inflation factors
## Population_Density
## 1.486647
## Percent_with_a_Graduate_Degree_
## 2.649078
## Percent_Black_Alone
## 6.037868
## Median_House_Value
## 2.578282
## Number_of_property_crimes
## 1.873371
## Weapons_Violation
## 3.530281
## Percent_Hispanic_
## 3.893392
## Burglary
## 2.957203
## Liquor_License
## 1.237177
## Vandalism
## 4.694101
## Percent_of_families_that_consist_of_married_couples
## 2.456693
## Drug_Abuse
## 2.016329
"
The cuttoff is 2.5, and here 8 of them are higher than this cuttoff and we should eliminate 3 of them.
Percent_Black_Alone
Percent_Hispanic_
Percent_of_families_that_consist_of_married_couples
Percent_with_a_Graduate_Degree_
Median_House_Value
Weapons_Violation
Burglary
Vandalism
"
4-3-4- examine the partial R-Square for each variable
library(lmSupport)
modelEffectSizes(mod1)
## lm(formula = adj_Vehicle_Theft ~ Population_Density + Percent_with_a_Graduate_Degree_ +
## Percent_Black_Alone + Median_House_Value + Number_of_property_crimes +
## Weapons_Violation + Percent_Hispanic_ + Burglary + Liquor_License +
## Vandalism + Percent_of_families_that_consist_of_married_couples +
## Drug_Abuse, data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr
## (Intercept) 18.8886 1 0.0243
## Population_Density 1.6994 1 0.0022
## Percent_with_a_Graduate_Degree_ 4.9417 1 0.0065
## Percent_Black_Alone 6.0269 1 0.0079
## Median_House_Value 19.1979 1 0.0247
## Number_of_property_crimes 40.6299 1 0.0509
## Weapons_Violation 1.4430 1 0.0019
## Percent_Hispanic_ 43.7412 1 0.0545
## Burglary 4.6007 1 0.0060
## Liquor_License 0.0691 1 0.0001
## Vandalism 69.9639 1 0.0845
## Percent_of_families_that_consist_of_married_couples 2.0049 1 0.0026
## Drug_Abuse 3.9173 1 0.0051
## dR-sqr
## (Intercept) NA
## Population_Density 0.0011
## Percent_with_a_Graduate_Degree_ 0.0031
## Percent_Black_Alone 0.0038
## Median_House_Value 0.0120
## Number_of_property_crimes 0.0255
## Weapons_Violation 0.0009
## Percent_Hispanic_ 0.0274
## Burglary 0.0029
## Liquor_License 0.0000
## Vandalism 0.0439
## Percent_of_families_that_consist_of_married_couples 0.0013
## Drug_Abuse 0.0025
##
## Sum of squared errors (SSE): 758.2
## Sum of squared total (SST): 1594.2
"
From the table generated from this function, the pEta-sqr is highest for Vandalism, at 0.0845,
and lowest for Liquor_License, at 0.0001. These pEta-sqr values reflect the marginal contribution of each variable to the model, its impact on the model given that all other variables
are already in the model.
"
5-1- What variables did you add/remove?
"A VIF test was conducted on the model to check variable multicolinearity. Results indicated that 8 variables may have correlations with one another. To compensate, 7 of the them were omitted:,
Percent_Black_Alone
Percent_Hispanic_
Percent_of_families_that_consist_of_married_couples
Percent_with_a_Graduate_Degree_
Median_House_Value
Weapons_Violation
Burglary"
5-2- check for multicollinearity (VIF)
A VIF on reduced model is provided below. Removing these variables did improve the VIFs:
## Population_Density Number_of_property_crimes
## 1.069952 1.554681
## Liquor_License Vandalism
## 1.117795 2.034895
## Drug_Abuse
## 1.469310
5-3- Diagnostic tests: AIC, Anova, extra sum of squares (modelEffectSizes, partial R-Squared), etc.
# Akaike Information Criterion (AIC)
# Small values are desirable
AIC(mod1) # 1980.626
## [1] 1980.626
AIC(mod1_red1) # 2058.176
## [1] 2058.176
# AIC has increased
#Anova: it tests whether reduction in the residual sum of squares are statistically significant or not
#anova(mod1, mod1_red1)
#I do not know why Anova does not work
# extra sum of squares (modelEffectSizes, partial R-Squared)
modelEffectSizes(mod1_red1)
## lm(formula = adj_Vehicle_Theft ~ Population_Density + Number_of_property_crimes +
## Liquor_License + Vandalism + Drug_Abuse, data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr dR-sqr
## (Intercept) 389.3589 1 0.3128 NA
## Population_Density 36.2908 1 0.0407 0.0225
## Number_of_property_crimes 36.3931 1 0.0408 0.0225
## Liquor_License 3.8459 1 0.0045 0.0024
## Vandalism 210.2257 1 0.1973 0.1302
## Drug_Abuse 1.3752 1 0.0016 0.0009
##
## Sum of squared errors (SSE): 855.4
## Sum of squared total (SST): 1614.8
# Vandalism had the highest partial sum of squares value, at 0.1973, showing it contributed most to the violent crime model. Drug_Abuse, on the other hand has the smallest partial sum of square
6-1- Follow the same workflow as the “1st Model Specification”/“Model adjustment” sections. You should aim to run through ˜2 more model iterations. Keep in mind that your final model does not need to be perfect, you will be explaining its shortcomings in the final model summary.
#I ran the model several more times and here is the final model:
mod1_final = lm(adj_Vehicle_Theft ~ Population_Density +
Liquor_License +
sqrt(Vandalism)+
sqrt(Drug_Abuse),
data=chicago@data)
summary(mod1_final)
##
## Call:
## lm(formula = adj_Vehicle_Theft ~ Population_Density + Liquor_License +
## sqrt(Vandalism) + sqrt(Drug_Abuse), data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5581 -0.7987 -0.0456 0.6983 4.2451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.602e-01 1.591e-01 4.149 3.78e-05 ***
## Population_Density 2.181e-05 5.256e-06 4.150 3.76e-05 ***
## Liquor_License 8.654e-02 3.862e-02 2.241 0.0254 *
## sqrt(Vandalism) 4.929e-01 2.738e-02 17.999 < 2e-16 ***
## sqrt(Drug_Abuse) 1.070e-02 1.463e-02 0.731 0.4648
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.101 on 655 degrees of freedom
## Multiple R-squared: 0.5081, Adjusted R-squared: 0.5051
## F-statistic: 169.1 on 4 and 655 DF, p-value: < 2.2e-16
# first of all, there were some outliers in the residuals. I tried to eliminate them, but it was an infinit loop. After eliminating Number_of_property_crimes, I did not have that problem. So It seems that thre are some outlires in the Number_of_property_crimes that affects the model.
# We tranform the Vandalism and Drug_Abuse using SQRT, and it significantly improve the model.
# This model has the the smallest AIC (1952.287)
# Still this model has homoscedasticity and the residuals are autocorrelated.
# I eliminated some of the variables due to colinearity
# Vandalism had the highest partial sum of squares value, at 0.1973, showing it contributed most to the violent crime model. Drug_Abuse, on the other hand has the smallest partial sum of square
7-1- In what ways does your final model explain the crime type you chose? Pay special attention to the sign and significance of each variable. Do the predictors align with your intuition about the association between social/economic characteristics and the frequency of crime?
" To explain our crime type (number of Vehicle Theft, we came up with four variable that can explain the variation of our model better.
1) Population_Density: this variable is significant, its increase leads to increasing in the number of violent crime. 2) Liquor_License is a fine variable in our model and with a positive coefficient has a direct relationship with number of Vehicle Theft. This predictor also align with my intuition about the association between social/economic characteristics and the number of Vehicle_Theft. 3) vandalism was the most statistically significant variable of this model. This may be due to vehicle theft and vandalism as co-occurring events, as well as increasing escalation of crimes and/or opportunity for crime. 4) Finally, Drug_Abuse that although is not a significant variable but differnet tests proved that it's an effective variable in our model.
"
7-2- What shortcomings does your final model have? In what ways is the model’s explanatory power limited? Does the model still violate assumptions of OLS regression (i.e. heteroskedasticity, residual [spatial]autocorrelation)? (Explain how these problems limit your ability to draw inferences about the crime type you chose.) How might the characteristics of the variables you chose influence these problems? [*]
"
Our final model has a R-square of 0.5054 that is considered fine in a linear regression model, but still the results are not completly satisfying. First of all, still there are outliers in our residuals. It seems that deleting outliers in not an straight forward process. Also we still have heteroscedasticity and correlated residuals in our model. After examining the plots of residuals vs variables, the residuals for vandalism does not bounce randomly around the 0 line, it suggests that the assumption that the relationship is linear is not reasonable. But I ploted vandalism and Vehicle Theft together and it's clear that they have a very strong linear relationship. I do not know what is exactly the reason of this issue.
"
1-1- How do the explanatory variables you chose reflect the crime type you are modeling?
"
The Type of Crime (Dependent Variable): Burglary
Variables of Interest (Independent Variable):
Weapons_Violation
Percent_Less_than_High_School_Education_
Number_of_violent_crimes
Number_of_property_crimes
Vandalism
Percent_of_children_in_single_female_headed_household
Disorderly_Conduct
For the number of Burglaries, I started with a set of related variables. for example, it seems that the increase in all of this variables can increase Vehicle Theft.
# I first run these above variables in univariant regression against one another.
plot(sqrt(Burglary) ~ (Weapons_Violation ), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Weapons_Violation) , data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.3537, Adjusted R-squared: 0.3526
F-statistic: 349.6 on 1 and 639 DF, p-value: < 2.2e-16
plot(sqrt(Burglary) ~ sqrt(Median_Rent_), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ Median_Rent_, data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 2.248e-05, Adjusted R-squared: -0.001552
F-statistic: 0.01427 on 1 and 635 DF, p-value: 0.9049
plot(sqrt(chicago@data$Burglary) , (chicago@data$Percent_Less_than_High_School_Education_))
violent_crimes <- lm(sqrt(Burglary) ~ (Percent_Less_than_High_School_Education_), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.0165, Adjusted R-squared: 0.01496
F-statistic: 10.72 on 1 and 639 DF, p-value: 0.001117
plot(sqrt(Burglary) ~ (Number_of_violent_crimes), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Number_of_violent_crimes), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.4443, Adjusted R-squared: 0.4435
F-statistic: 511 on 1 and 639 DF, p-value: < 2.2e-16
plot(sqrt(Burglary) ~ (Number_of_property_crimes), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Number_of_property_crimes), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.3026, Adjusted R-squared: 0.3015
F-statistic: 277.3 on 1 and 639 DF, p-value: < 2.2e-16
plot(sqrt(Burglary) ~ (Population_Density), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Population_Density), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.01085, Adjusted R-squared: 0.009299
F-statistic: 7.007 on 1 and 639 DF, p-value: 0.008319
plot(sqrt(Burglary) ~ (Vandalism), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Vandalism), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.6047, Adjusted R-squared: 0.6041
F-statistic: 977.4 on 1 and 639 DF, p-value: < 2.2e-16
plot(sqrt(Burglary) ~ sqrt(Percent_of_children_in_single_female_headed_household), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Percent_of_children_in_single_female_headed_household), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.0744, Adjusted R-squared: 0.07295
F-statistic: 51.36 on 1 and 639 DF, p-value: 2.126e-12
plot(sqrt(Burglary) ~ sqrt(Percent_Employed_in_Service_Occupations), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ sqrt(Percent_Employed_in_Service_Occupations), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.4178, Adjusted R-squared: 0.417
F-statistic: 472.3 on 1 and 658 DF, p-value: < 2.2e-16
plot(sqrt(Burglary) ~ (Disorderly_Conduct), data = chicago@data)
violent_crimes <- lm(sqrt(Burglary) ~ (Disorderly_Conduct ), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.1987, Adjusted R-squared: 0.1975
F-statistic: 158.5 on 1 and 639 DF, p-value: < 2.2e-16
Most of the variables are significant, in the first 'one to one' regression model test. I eliminated , Median_Rent_, Population_Density and Percent_Employed_in_Service_Occupations because I did not find a linear relationship between them and burglary
"
2-1- Summarize the data and/or provide a few visualizations (maps, plots). Map the dependent variable (a type of crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.00 22.00 27.23 36.00 144.00
2-2- What general trends do you observe?
"Based on the map of number of Burglary, we can see that this crime is not random and is clustered. Box plot shows that there are some outliers in the data."
3-1- Did you combine any fields, create dummy variables, transform variables, subset the data, etc?
# Box plot shows that data is not normal and there are some outliers in the data. Square root transform is used to normalize the data
# Shapiro-Wilk normality test
shapiro.test(chicago@data$Burglary) # W = 0.92161, p-value < 2.2e-16 < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: chicago@data$Burglary
## W = 0.85655, p-value < 2.2e-16
shapiro.test(sqrt(chicago@data$Burglary)) # W = 0.99351, p-value = 0.005998 (better)
##
## Shapiro-Wilk normality test
##
## data: sqrt(chicago@data$Burglary)
## W = 0.98185, p-value = 2.683e-07
#Create a new variable called adj_Burglary that ‘adjusts’ for potential outliers.
chicago$adj_Burglary = sqrt(chicago$Burglary)
# if we map the transformed data, we can see a random spatial patterns.
4-1- LM summary
summary(mod1)
##
## Call:
## lm(formula = adj_Burglary ~ Weapons_Violation + Percent_Less_than_High_School_Education_ +
## Number_of_violent_crimes + Number_of_property_crimes + Vandalism +
## Percent_of_children_in_single_female_headed_household + Disorderly_Conduct,
## data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2546 -0.7072 -0.0148 0.6884 3.8995
##
## Coefficients:
## Estimate
## (Intercept) 2.7427623
## Weapons_Violation 0.0333969
## Percent_Less_than_High_School_Education_ -0.0183536
## Number_of_violent_crimes 0.0042436
## Number_of_property_crimes 0.0009524
## Vandalism 0.0458025
## Percent_of_children_in_single_female_headed_household -0.0016928
## Disorderly_Conduct -0.0135845
## Std. Error t value
## (Intercept) 0.1180676 23.230
## Weapons_Violation 0.0144864 2.305
## Percent_Less_than_High_School_Education_ 0.0064658 -2.839
## Number_of_violent_crimes 0.0039678 1.070
## Number_of_property_crimes 0.0005315 1.792
## Vandalism 0.0032790 13.969
## Percent_of_children_in_single_female_headed_household 0.0025265 -0.670
## Disorderly_Conduct 0.0109276 -1.243
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Weapons_Violation 0.02146 *
## Percent_Less_than_High_School_Education_ 0.00467 **
## Number_of_violent_crimes 0.28524
## Number_of_property_crimes 0.07363 .
## Vandalism < 2e-16 ***
## Percent_of_children_in_single_female_headed_household 0.50308
## Disorderly_Conduct 0.21426
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.178 on 652 degrees of freedom
## Multiple R-squared: 0.6241, Adjusted R-squared: 0.62
## F-statistic: 154.6 on 7 and 652 DF, p-value: < 2.2e-16
4-2- Interpret model coefficients/p-values, model rˆ2/F-Test
"Most of the variables are significant, except, Number_of_violent_crimes, Percent_of_children_in_single_female_headed_household, and Disorderly_Conduct.
The R-square is fine (0.6204) that means our dependent variales explains msot of the variability of the adj_Burglary in the the regression model.
The p-value of our model is very small and it means that there is a significant relationship between the variables in the regression model.
Percent_of_children_in_single_female_headed_household and Percent_Less_than_High_School_Education_ coefficients seems sucpitious, because their increase lead to decrease in the violent crime
"
4-3- Model Diagnostics
4-3-1- Residual plots
plot(mod1)
hist(resid(mod1))
plot(resid(mod1) ~ predict(mod1))
abline(0, 0) # the horizon
4-3-1-1- Heteroskedasticity (non-constant variance)
# Evaluate heteroscedasticity
# If the test is positive (low p value), you should see if any transformation of the dependent variable helps you eliminate heteroscedasticity
library(zoo)
library(lmtest)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 42.447, df = 7, p-value = 4.265e-07
" There is heteroscedasticity
"
4-3-1-2- Influential observations (possible outliers)
# Bonferroni p-values to assesse Outliers
library(car)
outlierTest(mod1) # Bonferonni p-value for most extreme obs
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 82 -3.707766 0.00022692 0.14976
" There are 1 outliers based on Bonferroni test (82). But 520, 235, 420, 1nd 433 also is apparantly outliers based on the regression plots.
"
4-3-1-3- pattern (Nonlinearity)
# When the residuals bounce randomly around the 0 line, it suggests that the assumption that the relationship is linear is reasonable
# Evaluate Nonlinearity
# component + residual plot
crPlots(mod1)
" Vandalism are not linear.
"
4-3-1-4- Normality of Residuals
qqPlot(mod1, main="QQ Plot") #qq plot for studentized resid
# distribution of studentized residuals
library(MASS)
sresid <- studres(mod1)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
#Null hypothesis residuals are normally distributed
shapiro.test(resid(mod1))
##
## Shapiro-Wilk normality test
##
## data: resid(mod1)
## W = 0.99287, p-value = 0.003085
" Residuals are nearly normal. It seems there are some outliers.
"
4-3-1-5- Non-independence of Errors
# Test for Autocorrelated Errors (verifies if the residuals from a linear model are correlated or not)
library(car)
durbinWatsonTest(mod1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2451894 1.508794 0
## Alternative hypothesis: rho != 0
# The null hypothesis (H0H0) is that there is no correlation among residuals, i.e., they are independent.
# The alternative hypothesis (HaHa) is that residuals are autocorrelated.
" Residuals are autocorrelated
"
4-3-2- map residuals
" There is not a clear spatial pattern in the residual map.
"
4-3-3- check for multicollinearity (VIF) (Note: the cutoff is 2.5)
# Evaluate Collinearity
library(car)
vif(mod1) # variance inflation factors
## Weapons_Violation
## 3.960877
## Percent_Less_than_High_School_Education_
## 1.721708
## Number_of_violent_crimes
## 7.485356
## Number_of_property_crimes
## 1.816553
## Vandalism
## 3.935957
## Percent_of_children_in_single_female_headed_household
## 2.244981
## Disorderly_Conduct
## 2.125924
"
The cuttoff is 2.5, and here 2 of them are higher than this cuttoff and we should eliminate 1 of them.
Number_of_violent_crimes
Vandalism
"
4-3-4- examine the partial R-Square for each variable
library(lmSupport)
modelEffectSizes(mod1)
## lm(formula = adj_Burglary ~ Weapons_Violation + Percent_Less_than_High_School_Education_ +
## Number_of_violent_crimes + Number_of_property_crimes + Vandalism +
## Percent_of_children_in_single_female_headed_household + Disorderly_Conduct,
## data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr
## (Intercept) 749.0066 1 0.4529
## Weapons_Violation 7.3767 1 0.0081
## Percent_Less_than_High_School_Education_ 11.1832 1 0.0122
## Number_of_violent_crimes 1.5876 1 0.0018
## Number_of_property_crimes 4.4559 1 0.0049
## Vandalism 270.8136 1 0.2303
## Percent_of_children_in_single_female_headed_household 0.6231 1 0.0007
## Disorderly_Conduct 2.1449 1 0.0024
## dR-sqr
## (Intercept) NA
## Weapons_Violation 0.0031
## Percent_Less_than_High_School_Education_ 0.0046
## Number_of_violent_crimes 0.0007
## Number_of_property_crimes 0.0019
## Vandalism 0.1125
## Percent_of_children_in_single_female_headed_household 0.0003
## Disorderly_Conduct 0.0009
##
## Sum of squared errors (SSE): 904.9
## Sum of squared total (SST): 2407.1
"
From the table generated from this function, the pEta-sqr is highest for Vandalism, at 0.2163,
and lowest for Percent_of_children_in_single_female_headed_household, at 0.0006. These pEta-sqr values reflect the marginal contribution of each variable to the model, its impact on the model given that all other variables
are already in the model.
"
5-1- What variables did you add/remove?
"A VIF test was conducted on the model to check variable multicolinearity. Results indicated that 2 variables may have correlations with one another. To compensate, 1 of the them were omitted:,
I emiminated Number_of_violent_crimes.
"
5-2- check for multicollinearity (VIF)
A VIF on reduced model is provided below. Removing these variables did improve the VIFs:
## Weapons_Violation
## 3.265078
## Percent_Less_than_High_School_Education_
## 1.697362
## Number_of_property_crimes
## 1.557522
## Vandalism
## 3.024045
## Percent_of_children_in_single_female_headed_household
## 2.070943
## Disorderly_Conduct
## 1.962396
"now Weapons_Violation and Vandalism are colinear.
"
5-3- Diagnostic tests: AIC, Anova, extra sum of squares (modelEffectSizes, partial R-Squared), etc.
# Akaike Information Criterion (AIC)
# Small values are desirable
AIC(mod1) # 2042.116
## [1] 2099.312
AIC(mod1_red1) # 2040.859
## [1] 2098.469
# AIC has decresed
#Anova: it tests whether reduction in the residual sum of squares are statistically significant or not
#anova(mod1, mod1_red1)
#I do not know why Anova does not work
# extra sum of squares (modelEffectSizes, partial R-Squared)
modelEffectSizes(mod1_red1)
## lm(formula = adj_Burglary ~ Weapons_Violation + Percent_Less_than_High_School_Education_ +
## Number_of_property_crimes + Vandalism + Percent_of_children_in_single_female_headed_household +
## Disorderly_Conduct, data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr
## (Intercept) 818.1543 1 0.4744
## Weapons_Violation 12.7669 1 0.0139
## Percent_Less_than_High_School_Education_ 10.3499 1 0.0113
## Number_of_property_crimes 7.8037 1 0.0085
## Vandalism 378.9378 1 0.2948
## Percent_of_children_in_single_female_headed_household 0.2085 1 0.0002
## Disorderly_Conduct 1.3471 1 0.0015
## dR-sqr
## (Intercept) NA
## Weapons_Violation 0.0053
## Percent_Less_than_High_School_Education_ 0.0043
## Number_of_property_crimes 0.0032
## Vandalism 0.1574
## Percent_of_children_in_single_female_headed_household 0.0001
## Disorderly_Conduct 0.0006
##
## Sum of squared errors (SSE): 906.5
## Sum of squared total (SST): 2407.1
# Vandalism had the highest partial sum of squares value, at 0.1973, showing it contributed most to the violent crime model. Percent_of_children_in_single_female_headed_household, on the other hand has the smallest partial sum of square
6-1- Follow the same workflow as the “1st Model Specification”/“Model adjustment” sections. You should aim to run through ˜2 more model iterations. Keep in mind that your final model does not need to be perfect, you will be explaining its shortcomings in the final model summary.
#I ran the model several more times and here is the final model:
mod1_final = lm(adj_Burglary ~
Percent_Less_than_High_School_Education_ +
Weapons_Violation +
sqrt(Vandalism),
data=chicago@data)
summary(mod1_final)
##
## Call:
## lm(formula = adj_Burglary ~ Percent_Less_than_High_School_Education_ +
## Weapons_Violation + sqrt(Vandalism), data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9616 -0.7339 -0.0751 0.6287 4.0368
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.766300 0.162921 4.704
## Percent_Less_than_High_School_Education_ -0.022259 0.005092 -4.371
## Weapons_Violation 0.043915 0.009984 4.398
## sqrt(Vandalism) 0.654699 0.028475 22.992
## Pr(>|t|)
## (Intercept) 3.12e-06 ***
## Percent_Less_than_High_School_Education_ 1.44e-05 ***
## Weapons_Violation 1.27e-05 ***
## sqrt(Vandalism) < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.129 on 656 degrees of freedom
## Multiple R-squared: 0.6524, Adjusted R-squared: 0.6508
## F-statistic: 410.4 on 3 and 656 DF, p-value: < 2.2e-16
# first of all, there were some outliers in the residuals. I tried to eliminate them, but it was an infinit loop again the same as the previous model. After eliminating Number_of_property_crimes, I did not have that problem. So It seems that thre are some outlires in the Number_of_property_crimes that affects the model.
# I eliminated Percent_of_children_in_single_female_headed_household and Disorderly_Conduct because there were not significant.
# We tranform the Vandalism and it significantly improve the model.
# This model has the the smallest AIC (1984)
# I eliminated Number_of_violent_crimes due to colinearity
# Still this model has homoscedasticity and the residuals are autocorrelated. But we do not have outliers in the model
# I removed Weapons_Violation once due to the colinearity, but I added back Weapons_Violation to model and now we do not have collinearity.
# Vandalism had the highest partial sum of squares value, showing it contributed most to the violent crime model. Drug_Abuse, on the other hand Percent_Less_than_High_School_Education_ has the smallest partial sum of square
7-1- In what ways does your final model explain the crime type you chose? Pay special attention to the sign and significance of each variable. Do the predictors align with your intuition about the association between social/economic characteristics and the frequency of crime?
" To explain our crime type (number of Burglary here, we came up with three variable that can explain the variation of our model better.
1) Percent_Less_than_High_School_Education_: this variable is significant, and its increase leads to deacreasing in the number of violent crime. This predictor does not align with my intuition about the association between social/economic characteristics and the number of Burglary. 2) Weapons_Violation is a significant variable too and with a positive coefficient has a direct relationship with number of Vehicle Theft. 3) vandalism was the most statistically significant variable of this model. This may be due to Burglary and vandalism as co-occurring events, as well as increasing escalation of crimes and/or opportunity for crime.
"
7-2- What shortcomings does your final model have? In what ways is the model’s explanatory power limited? Does the model still violate assumptions of OLS regression (i.e. heteroskedasticity, residual [spatial]autocorrelation)? (Explain how these problems limit your ability to draw inferences about the crime type you chose.) How might the characteristics of the variables you chose influence these problems? [*]
"
Our final model has a R-square of 0.6461 that is considered fine in a linear regression model, but still the model still violate assumptions of OLS regression. we still have heteroscedasticity and correlated residuals in our model. After examining the plots of residuals vs variables, the residuals for vandalism does not bounce randomly around the 0 line, it suggests that the assumption that the relationship is linear is not reasonable. But I ploted vandalism and Vehicle Theft together and it's clear that they have a very strong linear relationship. After using vandalism for two model, I think there is an issue in this variable that we cannot eliminate heteroscedasticity and correlated residuals in our model.
"
1-1- How do the explanatory variables you chose reflect the crime type you are modeling?
"
The Type of Crime (Dependent Variable): Weapons_Violation
Variables of Interest (Independent Variable):
Number_of_violent_crimes
Number_of_property_crimes
Homicides
Percent_of_children_in_single_female_headed_household
PCT_of_Households_with_cash_public_assistance
Disorderly_Conduct
For the number of Weapons_Violation, I started with a set of related variables. It seems that the increase in all of this variables can increase Vehicle Theft.
# I first run these above variables in univariant regression against one another.
plot(sqrt(Weapons_Violation) ~ (Number_of_violent_crimes), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (Number_of_violent_crimes) , data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.6692, Adjusted R-squared: 0.6686
F-statistic: 1288 on 1 and 637 DF, p-value: < 2.2e-16
plot(sqrt(Weapons_Violation) ~ (Number_of_property_crimes), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (Number_of_property_crimes), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.1621, Adjusted R-squared: 0.1608
F-statistic: 123.2 on 1 and 637 DF, p-value: < 2.2e-16
plot(sqrt(Weapons_Violation) ~ (Homicides), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (Homicides), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.2517, Adjusted R-squared: 0.2506
F-statistic: 214.3 on 1 and 637 DF, p-value: < 2.2e-16
plot(sqrt(Weapons_Violation) ~ (Number_of_property_crimes), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (Number_of_property_crimes), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.3026, Adjusted R-squared: 0.3015
F-statistic: 277.3 on 1 and 639 DF, p-value: < 2.2e-16
plot(sqrt(Weapons_Violation) ~ (Percent_of_children_in_single_female_headed_household), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (Percent_of_children_in_single_female_headed_household), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.361, Adjusted R-squared: 0.36
F-statistic: 359.8 on 1 and 637 DF, p-value: < 2.2e-16
plot(sqrt(Weapons_Violation) ~ (PCT_of_Households_with_cash_public_assistance), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (PCT_of_Households_with_cash_public_assistance), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.391, Adjusted R-squared: 0.39
F-statistic: 408.9 on 1 and 637 DF, p-value: < 2.2e-16
plot(sqrt(Weapons_Violation) ~ (Disorderly_Conduct), data = chicago@data)
violent_crimes <- lm(sqrt(Weapons_Violation) ~ (Disorderly_Conduct), data = chicago@data)
summary(violent_crimes)
Multiple R-squared: 0.4476, Adjusted R-squared: 0.4467
F-statistic: 516.1 on 1 and 637 DF, p-value: < 2.2e-16
Most of the variables are significant, in the first 'one to one' regression model test. I eliminated , MPercent_Hispanic_ because I did not find a linear relationship between them and Weapons_Violation
"
2-1- Summarize the data and/or provide a few visualizations (maps, plots). Map the dependent variable (a type of crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 5.323 8.000 43.000
2-2- What general trends do you observe?
"Based on the map of number of Weapons_Violation, we can see that this crime is not random and is clustered. Box plot shows that there are some outliers in the data."
3-1- Did you combine any fields, create dummy variables, transform variables, subset the data, etc?
# Box plot shows that data is not normal and there are some outliers in the data. Square root transform is used to normalize the data
# Shapiro-Wilk normality test
shapiro.test(chicago@data$Weapons_Violation) # W = 0.92161, p-value < 2.2e-16 < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: chicago@data$Weapons_Violation
## W = 0.78814, p-value < 2.2e-16
shapiro.test(sqrt(chicago@data$Weapons_Violation)) # W = 0.99351, p-value = 0.005998 (better)
##
## Shapiro-Wilk normality test
##
## data: sqrt(chicago@data$Weapons_Violation)
## W = 0.94843, p-value = 2.105e-14
#Create a new variable called adj_Weapons_Violation that ‘adjusts’ for potential outliers.
chicago$adj_Weapons_Violation = sqrt(chicago$Weapons_Violation)
# if we map the transformed data, we can see a random spatial patterns.
4-1- LM summary
summary(mod1)
##
## Call:
## lm(formula = adj_Weapons_Violation ~ Number_of_violent_crimes +
## Number_of_property_crimes + Homicides + Percent_of_children_in_single_female_headed_household +
## PCT_of_Households_with_cash_public_assistance + Disorderly_Conduct,
## data = chicago@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04318 -0.52452 0.06791 0.48118 2.30027
##
## Coefficients:
## Estimate
## (Intercept) 0.3476733
## Number_of_violent_crimes 0.0246202
## Number_of_property_crimes -0.0008810
## Homicides 0.0555848
## Percent_of_children_in_single_female_headed_household 0.0050621
## PCT_of_Households_with_cash_public_assistance 0.0120444
## Disorderly_Conduct 0.0397978
## Std. Error t value
## (Intercept) 0.0621063 5.598
## Number_of_violent_crimes 0.0018576 13.254
## Number_of_property_crimes 0.0003224 -2.733
## Homicides 0.0332075 1.674
## Percent_of_children_in_single_female_headed_household 0.0015696 3.225
## PCT_of_Households_with_cash_public_assistance 0.0029079 4.142
## Disorderly_Conduct 0.0064627 6.158
## Pr(>|t|)
## (Intercept) 3.19e-08 ***
## Number_of_violent_crimes < 2e-16 ***
## Number_of_property_crimes 0.00645 **
## Homicides 0.09464 .
## Percent_of_children_in_single_female_headed_household 0.00132 **
## PCT_of_Households_with_cash_public_assistance 3.90e-05 ***
## Disorderly_Conduct 1.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7151 on 653 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7235
## F-statistic: 288.4 on 6 and 653 DF, p-value: < 2.2e-16
4-2- Interpret model coefficients/p-values, model rˆ2/F-Test
"Most of the variables are significant, except, Number_of_violent_crimes, Percent_of_children_in_single_female_headed_household, and Disorderly_Conduct.
The R-square is fine (0.6204) that means our dependent variales explains msot of the variability of the adj_Weapons_Violation in the the regression model.
The p-value of our model is very small and it means that there is a significant relationship between the variables in the regression model.
Percent_of_children_in_single_female_headed_household and Percent_Less_than_High_School_Education_ coefficients seems sucpitious, because their increase lead to decrease in the violent crime
"
4-3- Model Diagnostics
4-3-1- Residual plots
plot(mod1)
hist(resid(mod1))
plot(resid(mod1) ~ predict(mod1))
abline(0, 0) # the horizon
4-3-1-1- Heteroskedasticity (non-constant variance)
# Evaluate heteroscedasticity
# If the test is positive (low p value), you should see if any transformation of the dependent variable helps you eliminate heteroscedasticity
library(zoo)
library(lmtest)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 14.793, df = 6, p-value = 0.02193
" There is not heteroscedasticity
"
4-3-1-2- Influential observations (possible outliers)
# Bonferroni p-values to assesse Outliers
library(car)
outlierTest(mod1) # Bonferonni p-value for most extreme obs
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 601 3.276877 0.0011053 0.7295
" There are 1 outliers based on Bonferroni test (601).
"
4-3-1-3- pattern (Nonlinearity)
# When the residuals bounce randomly around the 0 line, it suggests that the assumption that the relationship is linear is reasonable
# Evaluate Nonlinearity
# component + residual plot
crPlots(mod1)
## Warning in smoother(.x, partial.res[, var], col = col.lines[2], log.x =
## FALSE, : could not fit smooth
" Number_of_violent_crimes are not linear.
"
4-3-1-4- Normality of Residuals
qqPlot(mod1, main="QQ Plot") #qq plot for studentized resid
# distribution of studentized residuals
library(MASS)
sresid <- studres(mod1)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
#Null hypothesis residuals are normally distributed
shapiro.test(resid(mod1))
##
## Shapiro-Wilk normality test
##
## data: resid(mod1)
## W = 0.9952, p-value = 0.0377
" Residuals are nearly normal. It seems there are a few outliers.
"
4-3-1-5- Non-independence of Errors
# Test for Autocorrelated Errors (verifies if the residuals from a linear model are correlated or not)
library(car)
durbinWatsonTest(mod1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.0954171 1.8064 0.012
## Alternative hypothesis: rho != 0
# The null hypothesis (H0H0) is that there is no correlation among residuals, i.e., they are independent.
# The alternative hypothesis (HaHa) is that residuals are autocorrelated.
" Residuals are not autocorrelated
"
4-3-2- map residuals
" There is not a clear spatial pattern in the residual map.
"
4-3-3- check for multicollinearity (VIF) (Note: the cutoff is 2.5)
# Evaluate Collinearity
library(car)
vif(mod1) # variance inflation factors
## Number_of_violent_crimes
## 4.453008
## Number_of_property_crimes
## 1.813820
## Homicides
## 1.494185
## Percent_of_children_in_single_female_headed_household
## 2.351556
## PCT_of_Households_with_cash_public_assistance
## 2.529481
## Disorderly_Conduct
## 2.018146
"
The cuttoff is 2.5, and here one of them is higher than this cuttoff.
"
4-3-4- examine the partial R-Square for each variable
library(lmSupport)
modelEffectSizes(mod1)
## lm(formula = adj_Weapons_Violation ~ Number_of_violent_crimes +
## Number_of_property_crimes + Homicides + Percent_of_children_in_single_female_headed_household +
## PCT_of_Households_with_cash_public_assistance + Disorderly_Conduct,
## data = chicago@data)
##
## Coefficients
## SSR df pEta-sqr
## (Intercept) 16.0255 1 0.0458
## Number_of_violent_crimes 89.8260 1 0.2120
## Number_of_property_crimes 3.8186 1 0.0113
## Homicides 1.4328 1 0.0043
## Percent_of_children_in_single_female_headed_household 5.3191 1 0.0157
## PCT_of_Households_with_cash_public_assistance 8.7732 1 0.0256
## Disorderly_Conduct 19.3925 1 0.0549
## dR-sqr
## (Intercept) NA
## Number_of_violent_crimes 0.0737
## Number_of_property_crimes 0.0031
## Homicides 0.0012
## Percent_of_children_in_single_female_headed_household 0.0044
## PCT_of_Households_with_cash_public_assistance 0.0072
## Disorderly_Conduct 0.0159
##
## Sum of squared errors (SSE): 333.9
## Sum of squared total (SST): 1218.9
"
From the table generated from this function, the pEta-sqr is highest for Number_of_violent_crimes , at 0.1915, and lowest for Number_of_property_crimes, at 0.0009. These pEta-sqr values reflect the marginal contribution of each variable to the model, its impact on the model given that all other variables
are already in the model.
"
5-1- What variables did you add/remove?
"This model has been the best model so far. We did not have any colinearity. I eliminated Number_of_property_crimes and Homicides because they were not significant. Also I transform the remaining variables using SQRT.
"
5-2- check for multicollinearity (VIF)
A VIF on reduced model is provided below. Removing these variables did improve the VIFs:
## sqrt(Number_of_violent_crimes)
## 2.772351
## sqrt(Percent_of_children_in_single_female_headed_household)
## 2.311293
## sqrt(PCT_of_Households_with_cash_public_assistance)
## 2.504350
## sqrt(Disorderly_Conduct)
## 2.284723
"now Weapons_Violation and Vandalism are colinear.
"
5-3- Diagnostic tests: AIC, Anova, extra sum of squares (modelEffectSizes, partial R-Squared), etc.
# Akaike Information Criterion (AIC)
# Small values are desirable
AIC(mod1) # 1387.7
## [1] 1439.332
AIC(mod1_red1) # 1363.333
## [1] 1416.573
# AIC has decresed
#Anova: it tests whether reduction in the residual sum of squares are statistically significant or not
anova(mod1, mod1_red1)
## Analysis of Variance Table
##
## Model 1: adj_Weapons_Violation ~ Number_of_violent_crimes + Number_of_property_crimes +
## Homicides + Percent_of_children_in_single_female_headed_household +
## PCT_of_Households_with_cash_public_assistance + Disorderly_Conduct
## Model 2: adj_Weapons_Violation ~ sqrt(Number_of_violent_crimes) + sqrt(Percent_of_children_in_single_female_headed_household) +
## sqrt(PCT_of_Households_with_cash_public_assistance) + sqrt(Disorderly_Conduct)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 653 333.93
## 2 655 324.57 -2 9.3576
# The reduction in the residual sum of squares are not statistically significant
# extra sum of squares (modelEffectSizes, partial R-Squared)
modelEffectSizes(mod1_red1)
## lm(formula = adj_Weapons_Violation ~ sqrt(Number_of_violent_crimes) +
## sqrt(Percent_of_children_in_single_female_headed_household) +
## sqrt(PCT_of_Households_with_cash_public_assistance) + sqrt(Disorderly_Conduct),
## data = chicago@data)
##
## Coefficients
## SSR df
## (Intercept) 62.7149 1
## sqrt(Number_of_violent_crimes) 126.5043 1
## sqrt(Percent_of_children_in_single_female_headed_household) 6.6564 1
## sqrt(PCT_of_Households_with_cash_public_assistance) 6.3232 1
## sqrt(Disorderly_Conduct) 23.4171 1
## pEta-sqr
## (Intercept) 0.1619
## sqrt(Number_of_violent_crimes) 0.2805
## sqrt(Percent_of_children_in_single_female_headed_household) 0.0201
## sqrt(PCT_of_Households_with_cash_public_assistance) 0.0191
## sqrt(Disorderly_Conduct) 0.0673
## dR-sqr
## (Intercept) NA
## sqrt(Number_of_violent_crimes) 0.1038
## sqrt(Percent_of_children_in_single_female_headed_household) 0.0055
## sqrt(PCT_of_Households_with_cash_public_assistance) 0.0052
## sqrt(Disorderly_Conduct) 0.0192
##
## Sum of squared errors (SSE): 324.6
## Sum of squared total (SST): 1218.9
# Vandalism had the highest partial sum of squares value, at 0.1973, showing it contributed most to the violent crime model. Percent_of_children_in_single_female_headed_household, on the other hand has the smallest partial sum of square
6-1- Follow the same workflow as the “1st Model Specification”/“Model adjustment” sections. You should aim to run through ˜2 more model iterations. Keep in mind that your final model does not need to be perfect, you will be explaining its shortcomings in the final model summary.
# This model was quite intersting in the very first try, I do not do not see any issue in this model to rerun it. The R-squared is 0.7355 that is perfect. There is no no colinearity, autocorrelated residuals, and heteroscedasticity in this model. Great Job. There is only a few outlier residuals that is normal.
7-1- In what ways does your final model explain the crime type you chose? Pay special attention to the sign and significance of each variable. Do the predictors align with your intuition about the association between social/economic characteristics and the frequency of crime?
" To explain our crime type (number of Weapons_Violation here, we came up with four variable that can explain the variation of our model better. All of the variables in this model are statistically significant. Also this predictor does align with my intuition about the association between social/economic characteristics and the number of Weapons_Violation
"
7-2- What shortcomings does your final model have? In what ways is the model’s explanatory power limited? Does the model still violate assumptions of OLS regression (i.e. heteroskedasticity, residual [spatial]autocorrelation)? (Explain how these problems limit your ability to draw inferences about the crime type you chose.) How might the characteristics of the variables you chose influence these problems? [*]
"
Our final model has a R-square of 0.7355 that is considered perfect in a linear regression model, but still the model still violate assumptions of OLS regression. As I discussed this model is very intersting, and unlike the other models, this model do not violate assumptions of OLS regression.
"
"
Given the many models run overall, several patterns seemed to emerge regarding predictors and variables associated with crime. I general, I do not see race variables (especially African Americans) as an an explicit and significant indicator for crime incidents in Chicago, as do other variables such as Drug Abuse and Vandalism. I only saw that hat as crime increases the percent Hispanic decreases that can be due to the due to the institutional racism. It seems that poverty plays a larger role in crime than does race. In our dependent variables, we can observe crime clusters in the South and Northwestern portions of the city. I used crime variables as independent variables to investigate whether certain neighborhood-level attributes routinely with crime type of crime. I modeling our dependent variables (violent crime, vehicle theft, burglary and weapons violations), we can see a strong spatial relationship toward the co-occurrence of crimes such as vandalism, disorderly conduct, property theft, etc. For example, Burglary and Vandalism or Vehicle Theft and Drug Abuse had a shared spatial quality. Or perhaps that locations with high crime occurrences might have greater crime diversity. Therefore, it seems reasonable to expect police to use data to inform their strategies and tactics on some tracts due to the co-occurrence of crimes, but it seems unfair to target individuals due to racial characteristics. It seems interesting to inverstigate the reasons of co-occurrence of some crimes.
"